Marketing Research

MKTG 440

Sampling

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
2/16Sampling
2/17
2/18No Class
2/19
2/20Thinking
Forward
2/21
2/22GA2 due
2/23Exam
Review
2/24
2/25Exam 1
2/26
2/27
2/28
3/1GA3 due
3/2Technology
Day
3/3
3/4No Class
3/5
3/6
3/7
3/8
3/9No Class
3/10
3/11No Class
3/12
3/13
3/14
3/15

This week’s topics

  • Exam 1 prep
  • Why we sample
  • Sampling frames
  • Types of errors
  • Probability sampling
  • Non-probability sampling
  • Why sampling works

Exam 1 prep

Practice exam

  • Last year’s exam 1
    • D2L → Content → Resources
    • Separate file for solutions (don’t deprive yourself of the opportunity to practice on your own first!)
      • For SA, other solutions may be possible (provided solutions give you a sense of what a good answer looks like, how much you need to write – generally not much)
  • Actual exam
    • Instructions will be identical to practice exam
    • Format will be very similar (MC and SA)

How I would prepare

  • Review lecture slides and your notes
    • Pay attention to examples, embedded practice Qs
  • Take the practice exam
    • Check your answers against the solutions
  • Identify areas of weakness and review those topics again
  • LLMs can be useful for practice and review
    • Explain concepts, invent new questions, etc.
    • Be cautious using models not trained on our course materials (e.g., ChatGPT)
    • I recommend NotebookLM

Why do we sample?

The sampling problem

  • Goal: Make claims about a certain group of people or objects (“population”)

  • Problem: For cost/time reasons it is impossible to collect data on everyone

    • A census involves a complete enumeration of the elements of a population
    • US census (2020) collects 9 pieces of information on 300 million+ people, which is estimated to cost $15.6B (plus many years of planning)
  • Solution: select a subset of the population, i.e. “sample”

    • Generalize what we learn from the sample to the population
    • For the data to be useful, we have to be careful in how we choose the subset of the population in our sample

Population versus sample

Population: The entire group of individuals we are interested in.

  • Example: All humans, all Tucson residents, all UA students etc.

  • A parameter is a number describing a characteristic of the population (e.g., population mean)

Sample: The subset of the population we collect data from.

  • A statistic is a number describing a characteristic of a sample (e.g., sample mean)

  • How well the sample represents the population depends on the sampling design

population sample μ (parameter) x (statistic)

Sampling design process

  1. Define the target population
  2. Determine the sampling frame
  3. Select a sampling technique
  4. Determine the sample size
  5. Implement the sampling process (collect the data)

Sampling frame

What is a sampling frame?

  • A sampling frame is, ideally, a list of every person/object in the target population.
    • Can be expensive and time-consuming to create this list
  • Where we can we find sampling frames?
    • customer/membership list (for existing relationships)
    • registered voter list (for an election)
    • telephone book (for households with landline)
  • It is difficult to be exhaustive (complete)
    • Rare events may not make the list (e.g., undocumented immigrants, people experiencing homelessness, people that value their physical/digital privacy)

Example: UA students

  • Suppose we were interested in surveying current students at UA

  • What is a sampling frame we could use?

  • One possibility:

    • All individuals with a @arizona.edu email
  • What problems might we run into?

Sampling frame error

Target Population: Students at UA Sampling Frame: All @arizona emails Sampling Frame Error
  • Sampling frame error is any mismatch between the sampling frame and the target population
    • Can be due to undercoverage or overcoverage
    • One type of non-sampling error

Sampling frame error can lead to bias

  • Suppose you want to describe the average income of influencers on a particular platform?

  • How might we generate a sampling frame for this target population?

  • How could that sampling frame lead to a biased estimate of income?

Classifying error

Sampling vs. non-sampling error

Sampling error refers to differences between a sample statistic and a population parameter that result from using a sample instead of a census.

  • Bad news: sampling error is unavoidable
    • Even with a random sample, subsets of the population have differences
    • The sample mean will (probably) not be exactly the same as population mean
  • Good news: we can quantify the sampling error using statistics
    • Confidence intervals, p-values, and standard errors
    • The bigger the sample size, the smaller the sampling error

Sampling vs. non-sampling error

Non-sampling error is error caused by our sampling method or data collection process.

  • Can lead to bias if the sampling frame is systematically different than the target population

May be due to…

  • Sampling frame error: sampling frame does not represent the target population
  • Poor questionnaire design and delivery
    • Non-response error: respondents choose not to participate in the survey
      • Can mitigate with incentives
    • Response error: respondents provide incorrect information
      • Because they can’t remember, because it’s sensitive, etc.
      • Can mitigate by designing better surveys

Compounding errors

Population Sampling frame Response

Sampling procedures

Probability vs. non-probability sampling

Probability

  • Every element in the target population has a non-zero probability of inclusion in the sample
  • Allows quantification of sampling error

Non-probability

  • Population elements are selected in a non-random manner
  • Usually based on researcher’s judgement

Sampling techniques overview

Probability Sampling Simple Random Each element has a known and equal probability Systematic Every kth element is selected Stratified Random sample of elements from each stratum Cluster Random sample of clusters is selected Non-Probability Sampling Convenience Sample based on convenience Judgmental Sample based on judgment Quota Two-stage restricted judgmental sampling Snowball Subsequent selection based on referrals

Probability sampling

Simple random sampling (SRS)

Each element of the population has the same, known, non-zero probability of inclusion.

Mechanics

  • Uses computer-generated random numbers

Advantages

  • Simple
  • Favorable statistical properties (easy to quantify sampling error)

Disadvantages

  • Requires a complete and accurate list of target population (sampling frame)

Systematic sampling

Include every k-th element from the population list.

Mechanics

  • Choose a random starting point, then select every kth element
  • Skip interval k = population size / sample size

k

Advantages

  • Can be more efficient than SRS
  • Useful when the sampling frame is not digitized
  • Useful for objectives like quality control

Disadvantages

  • Hidden patterns can cause bias (e.g., if k aligns with a cyclical pattern)
  • Requires knowledge of the population size

Stratified sampling

Divide the population into strata, then randomly sample from each stratum.

Mechanics

  • Elements within a stratum should be homogeneous (similar); elements across strata should be heterogeneous (different)
  • Strata based on characteristics (e.g., age, major, income level, etc.)

STRATUM 1 STRATUM 2 STRATUM 3

Advantages

  • Ensures important (or small) subgroups are represented
  • More representative than SRS

Disadvantages

  • Can be hard to identify the right basis for stratifying
  • Difficult to implement with too many strata

Cluster sampling

Divide the population into clusters, then randomly select entire clusters.

Mechanics

  • Elements within a cluster should be heterogeneous (opposite of stratified)
  • Example: geographic areas (city, zip)

CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 CLUSTER 5 CLUSTER 6

Advantages

  • More cost-effective than stratified sampling
  • Only need a sampling frame for selected clusters

Disadvantages

  • Less representative than stratified sampling
  • Higher sampling error

Sampling people from different cities

  • Goal: Sample 10,000 people from these 50 cities
  • How should we approach this sampling problem?

Cities - SRS of people

  • With SRS, each person has the same probability of being selected
  • Larger cities contribute more people to the sample simply because they have more people
  • Big cities dominate; small cities may be barely represented
  • Not ideal if we care about city-level insights

Cities - stratified (SRS within strata)

  • If we want a representative sample from each city, we might think of each city as a stratum
  • Then conduct a SRS of 200 participants within each city

Cities - cluster

  • If costs are a concern, we might think of each city as a cluster
  • First, sample (SRS) 20 cities from the list of 50
  • Then, sample (SRS) 500 participants within each city
  • When is this approach as good as (or worse than) the stratified approach?

Cluster vs stratified summary

Cluster sampling

  • Sample whole clusters (groups)
  • Elements are different within a cluster, while compositions of clusters are similar
  • A sampling frame is needed only for the clusters selected for the sample
  • More efficient

Stratified sampling

  • Sample within specific strata
  • Within a stratum, elements are homogeneous with clear differences (heterogeneity) between the strata
  • A complete sampling frame for the entire stratified sub-populations should be drawn
  • Can be more representative

Cost Comparison: If plane tickets + hotels cost $1000, and surveying a person costs $2…

  • Stratified cost = \(50 \times \$1{,}000 + 10{,}000 \times \$2 = \$70{,}000\)
  • Cluster cost = \(20 \times \$1{,}000 + 10{,}000 \times \$2 = \$40{,}000\)

Multistage sampling example

  • We are interested in surveying adults in US cities.

  • We want to make sure we have equal coverage across all cities in addition to a representative sample of many different income levels and ethnicities in our data.

  • How can we do this by combining stratified and cluster sampling?

Multistage sampling solution

  • Stage 1: Geographical clusters (metropolitan area and rural area)

  • Stage 2: Ethnic and income strata within each geographical cluster

  • Data are obtained by taking an SRS for each sub-stratum (sub-cluster).

Practice

A marketing research firm is hired by a fashion brand to estimate the average monthly spending on clothing by online shoppers.

  • How might you obtain a sampling frame?
  • What are two potential limitations of your sampling frame?
  • How might you generate a simple random sample?
  • How might you generate a systematic random sample?
  • How might you generate a stratified sample?
  • How might you generate a clustered sample?
  • Suppose the brand is particularly interested in a customer segment that makes up a small proportion of your sampling frame. Which method(s) will ensure the most useful sample?

Non-probability sampling

Non-probability sampling techniques

  • Whether an element enters the sample or not depends on researcher’s judgement
  • Sometimes the best we can do
  • But, no guarantee that your sample will be representative

Convenience sampling (accidental sampling)

Convenience sampling: A sample of convenient elements

  • They happen to be in the right place at the right time
  • Examples: students in the front row, “people-on-the-street” interview, mall intercept interview

Advantages

  • Least expensive
  • Least time consuming

Disadvantages

  • Not representative – may not able to generalize findings

Judgmental sampling (purposive sampling)

Judgmental sampling: The elements/subjects are selected based on the judgment of the researcher

  • Examples: expert witness in the court, subjects in a focus group
  • Popular for qualitative studies

Advantages

  • Low cost and still less time-consuming than most other techniques

Disadvantages

  • Results depend on the judgement of the researcher
  • No guarantee for a representative sample

Quota sampling

Quota sampling: A two-step restricted form of judgmental sampling

  1. Develop quotas for each control characteristic
  2. Use convenience or judgement sampling to fulfill each quota
Control characteristic: Gender Population % Quota Sample
Male 41% 410
Female 55% 550
Non-binary/Other 4% 40
100% 1,000

Advantages

  • Ensures representation of key subgroups

Disadvantages

  • Results still depend on judgement
  • Still no guarantee of a representative sample

Snowball sampling (referral sampling)

Snowball sampling: Start with a few participants and ask them to refer additional participants via their social networks.

  • Useful for hard-to-reach populations (e.g., drug users, undocumented immigrants)

Advantages

  • Can reach populations where a sampling frame is difficult to obtain

Disadvantages

  • Difficult to obtain the initial sample
  • Community bias – no guarantee for representative sample

Coverage issues

U.S. Census

  • The U.S. Census is described as going “door-to-door” to count every person living in the U.S. on April 1st of the census year
  • Who is left out? People experiencing homelessness. People who are undocumented. People who don’t want to be counted

Clinical trials

  • Historically, many drug trials (especially early-phase) excluded women of childbearing potential
  • One driving reason was concern about potential harm to a fetus
  • However, clearly problematic if the drug is intended for use in women, and may produce different side effects

Walmart survey case study

Walmart in Soledad, CA (2008)

  • 2008: Wal-Mart wants to open a store in Soledad, CA
  • Objective: “measure” community support for a new store
    • Constructed a survey
    • Mailed to 5100 Soledad residents

The survey letter

PS: I have included a card to indicate your support for a new Wal-Mart in Soledad. Please take a moment to fill out the card and mail it back to us. Your opinion is important to us.

The survey card

Survey results

Results from the survey:

  • 95% → good for economy
  • 95% → they would shop
  • 87% → they know someone who’d apply for a job
  • 90% → they would support a Wal-Mart

Four months later…

What went wrong in Walmart’s survey?

  • Bias exists if response is NOT independent of outcome variable

  • Measured Effect = True Effect + Bias

  • What specific kind of bias here?

Other examples?

Have you come across other examples of possibly biased sampling?

Why does sampling work?

Central limit theorem

  • [IF] the sample size is sufficiently large (≥ 30)
  • [THEN] regardless of the shape of the population distribution, the sampling distribution of means will be approximately normal in shape.
  • We use the CLT to develop statistical tests and quantify uncertainty in our sample

Non-marketing example: Loaded dice

  • Suppose I roll a die 50 times, making note of the roll every time
  • The average value of those 50 rolls is 4.1
    • We know that the average value of a fair die role is 3.5
  • Is this a “fair” die?
    • Is 4.1 different enough from 3.5?

We can answer this with sampling

  • Simulate the outcome of our sampling procedure (50 dice rolls) given what we know about fair dice rolls (the population)
  • An average distribution of 4.1 appears unlikely (still not impossible)

What size should our sample be?

Factors that determine sample sizes with probability sampling:

  • Population variance (\(\sigma^2\)): Is there a lot of variation among subjects within the population?
    • Have to estimate this (pilot survey or use prior knowledge, e.g., last year’s survey results)
  • Level of confidence desired in the estimate (\(Z\))
  • Acceptable margin of error in the sample estimate (\(e\)): smaller \(e\) means the sample estimate is more precise.

\[n = \frac{Z^2 \times \sigma^2}{e^2}\]

Sample size for a survey

Goal: Estimate the average satisfaction score (0-10 scale)

  • Population variance (\(\sigma^2\)): \(\sigma = 2.5 \Rightarrow \sigma^2 = 6.25\)
  • 95% confidence \(\Rightarrow Z = 1.96\)
  • Acceptable margin of error: \(e = 0.50\) points (we want “\(\pm 0.50\)” precision)

\[ n=\frac{Z^2\sigma^2}{e^2} =\frac{(1.96)^2(6.25)}{(0.50)^2} \approx 96.04 \]

Survey at least \(n=97\) customers to estimate the mean satisfaction score within \(\pm 0.50\) points at 95% confidence.